In this Project , we will analyze to see Which chemical properties influence the quality of white wines. We will use the knowledge of descriptive analysis , summary statistics , exploratory data analysis and modelling techniques as we go through the process of analysis.
orignal data set is present at following location
http://www3.dsi.uminho.pt/pcortez/wine/
Lets first Define what are we trying to achieve? Our objective is to see what are chemical properties that influence the quality of white wine. To do this, we will use the data set , available through udacity, which contains 4898 records and 11 + 1 output attribute.
Lets try to understand the data by going through what each variable means in this data set.
Attribute information: Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)
Description of attributes:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
First of all, lets load the libraries that we may need
options(warn=-1)
options(message=FALSE)
options(tidy=TRUE)
options(fig.height=12)
options(fig.width=12)
suppressMessages(library(corrplot))
suppressMessages(library(ggplot2)) # To draw Plots
suppressMessages(library(tidyr)) # To wrangle our data, if required
suppressMessages(library(dplyr)) # To wrangle our data,if required
suppressMessages(library(GGally)) # to draw our scatterplot matrix
suppressMessages(library(scales))
suppressMessages(library(memisc))
suppressMessages(library(gridExtra)) # To grid multiple plots
suppressMessages(library(corrplot)) # to plot the correlation matrix
Lets set the global options for all out plots and set warning and message to FALSE so that we dont see inappropriate messages in our final output.
knitr::opts_chunk$set(fig.width=12, fig.height=8, fig.path='Figs/',
warning=FALSE, message=FALSE)
Lets load our data set. We will read the csv file.
#read the csv file (Give path in file.path variable)
white_wine <- read.csv(file= file.path("E:","/DataScienceWithR/Nano Degree Udacity/Projects/Project 4/wineQualityWhites.csv"))
Now, our data is loaded in white_wine data set. Lets take a closer look at the data set.
#glimpse is the function from dplyr package. It is similar to str function but also returns first few values for each variable
glimpse(white_wine)
## Observations: 4,898
## Variables: 13
## $ X (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ fixed.acidity (dbl) 7.0, 6.3, 8.1, 7.2, 7.2, 8.1, 6.2, 7.0, 6...
## $ volatile.acidity (dbl) 0.27, 0.30, 0.28, 0.23, 0.23, 0.28, 0.32,...
## $ citric.acid (dbl) 0.36, 0.34, 0.40, 0.32, 0.32, 0.40, 0.16,...
## $ residual.sugar (dbl) 20.70, 1.60, 6.90, 8.50, 8.50, 6.90, 7.00...
## $ chlorides (dbl) 0.045, 0.049, 0.050, 0.058, 0.058, 0.050,...
## $ free.sulfur.dioxide (dbl) 45, 14, 30, 47, 47, 30, 30, 45, 14, 28, 1...
## $ total.sulfur.dioxide (dbl) 170, 132, 97, 186, 186, 97, 136, 170, 132...
## $ density (dbl) 1.0010, 0.9940, 0.9951, 0.9956, 0.9956, 0...
## $ pH (dbl) 3.00, 3.30, 3.26, 3.19, 3.19, 3.26, 3.18,...
## $ sulphates (dbl) 0.45, 0.49, 0.44, 0.40, 0.40, 0.44, 0.47,...
## $ alcohol (dbl) 8.8, 9.5, 10.1, 9.9, 9.9, 10.1, 9.6, 8.8,...
## $ quality (int) 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 7,...
In order, to better understand our data lets apply some statistics. This step will help us in building a better understanding of data. It will also help us in detecting any flaws in data, if present.
#get the summary for all the variables in the dataset.
summary(white_wine)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
We see that mode
Lets see the distribution of rest of the data as well.
#use geom_histogram to draw the histogram
v1<-ggplot(aes(fixed.acidity),data=white_wine)+geom_histogram()+
ggtitle("Fixed.acidity Distribution")
v2<-ggplot(aes(volatile.acidity),data=white_wine)+geom_histogram()+
ggtitle("volatile.acidity Distribution")
v3<-ggplot(aes(citric.acid),data=white_wine)+geom_histogram()+
ggtitle("citric.acid Distribution")
v4<-ggplot(aes(residual.sugar),data=white_wine)+geom_histogram()+
ggtitle("residual.sugar Distribution")
v5<-ggplot(aes(chlorides),data=white_wine)+geom_histogram()+
ggtitle("chlorides Distribution")
v6<-ggplot(aes(free.sulfur.dioxide),data=white_wine)+geom_histogram()+
ggtitle("free.sulfur.dioxide Distribution")
v7<-ggplot(aes(total.sulfur.dioxide),data=white_wine)+geom_histogram()+
ggtitle("total.sulfur.dioxide Distribution")
v8<-ggplot(aes(density),data=white_wine)+geom_histogram()+
ggtitle("Density Distribution")
v9<-ggplot(aes(pH),data=white_wine)+geom_histogram()+
ggtitle("pH Distribution")
v10<-ggplot(aes(sulphates),data=white_wine)+geom_histogram()+
ggtitle("Sulphates Distribution")
v11<-ggplot(aes(alcohol),data=white_wine)+geom_histogram()+
ggtitle("(%) Alcohol Distribution")
#lets arrange all the histograms on a single grid
grid.arrange(v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11)
As we suspected, most of the data is uniformly distributed except, residual.sugar and alcohal.
lets further analyze these two distributions. As most of the data for residual.sugar column is within the range of 0-20 so lets zoom in our plot by transforming the x axis
# i have used log transformation on x axis and geom_freqpoly to better see the distribution and draw the frequence poly.
ggplot(aes(residual.sugar),data=white_wine)+geom_histogram(fill="grey")+
scale_x_log10()+
geom_freqpoly()+
ggtitle("residual.sugar(log10) distribution")
Intersting, we see a bimodal distribution for residual.sugar variable.
Another observation is that many features such as sulphates,alcohal,volatile.acidity contains some outliers , which have impact on the same of the curve. So lets try to take out these outliers and draw 99 percentile of the data.
# I have subset the dataset using subset and quantile function to remove top 1% of records
v1<-ggplot(aes(fixed.acidity),
data=subset(white_wine,white_wine$fixed.acidity<
quantile(white_wine$fixed.acidity,0.99)))+
geom_histogram()+
ggtitle("Fixed.acidity distribution(99 quantile)")
v2<-ggplot(aes(volatile.acidity),
data=subset(white_wine,white_wine$volatile.acidity<
quantile(white_wine$volatile.acidity,0.99)))+
geom_histogram()+
ggtitle("volatile.acidity distribution(99 quantile)")
v3<-ggplot(aes(citric.acid),
data=subset(white_wine,white_wine$citric.acid<
quantile(white_wine$citric.acid,0.99)))+
geom_histogram()+
ggtitle("Fixed.aciditycitric.acid distribution(99 quantile)")
v4<-ggplot(aes(residual.sugar),
data=subset(white_wine,white_wine$residual.sugar<
quantile(white_wine$residual.sugar,0.99)))+
geom_histogram()+
scale_x_log10()+
ggtitle("residual.sugar(log10) distribution(99 quantile)")
v5<-ggplot(aes(chlorides),
data=subset(white_wine,white_wine$chlorides<
quantile(white_wine$chlorides,0.99)))+
geom_histogram()+
ggtitle("chlorides distribution(99 quantile)")
v6<-ggplot(aes(free.sulfur.dioxide),
data=subset(white_wine,white_wine$free.sulfur.dioxide<
quantile(white_wine$free.sulfur.dioxide,0.99)))+
geom_histogram()+
ggtitle("free.sulfur.dioxide distribution(99 quantile)")
v7<-ggplot(aes(total.sulfur.dioxide),
data=subset(white_wine,white_wine$total.sulfur.dioxide<
quantile(white_wine$total.sulfur.dioxide,0.99)))+
geom_histogram()+
ggtitle("total.sulfur.dioxide distribution(99 quantile)")
v8<-ggplot(aes(density),
data=subset(white_wine,white_wine$density<
quantile(white_wine$density,0.99)))+
geom_histogram()+
ggtitle("density distribution(99 quantile)")
v9<-ggplot(aes(pH),
data=subset(white_wine,white_wine$pH<quantile(white_wine$pH,0.99)))+
geom_histogram()+
ggtitle("pH distribution(99 quantile)")
v10<-ggplot(aes(sulphates),
data=subset(white_wine,white_wine$sulphates<
quantile(white_wine$sulphates,0.99)))+
geom_histogram()+
ggtitle("sulphates distribution(99 quantile)")
v11<-ggplot(aes(alcohol),
data=subset(white_wine,white_wine$alcohol<
quantile(white_wine$alcohol,0.99)))+
geom_histogram()+
ggtitle("(%) alcohol distribution(99 quantile)")
# lets arrange the plots on a single grid
grid.arrange(v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11)
Now, From the data dictionary, we learned that total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
So lets create another variable bound Sulfur dioxide to better observe the distribution of this variable.
# i have used mutuate function from dplyr package to create a new variable bound_SO2
white_wine <- white_wine %>%
mutate(bound_SO2 = total.sulfur.dioxide-free.sulfur.dioxide)
Now, we have a new column bound_SO2 in our original data set, lets see the distribution of the data in this new column.
# I have subset the dataset using subset and quantile function to remove top 1% of records
ggplot(aes(bound_SO2),
data=subset(white_wine,white_wine$bound_SO2<
quantile(white_wine$bound_SO2,0.99)))+
geom_histogram()+
ggtitle("bounds_SO2 distribution(99 quantile)")
As expected, this follows the same distribution as its parent variables because it is derived from those variables.
We should also convert our quality column , which is current int , in to factor variable ,given the usage and description of this variable
# i have used as.factor function to convert the numeric columns in to factors
white_wine$quality <- as.factor(white_wine$quality)
There are 4898 observation and 14 variables(“X”,“fixed.acidity”,“volatile.acidity” “citric.acid”, “residual.sugar”,“chlorides” ,“free.sulfur.dioxide” ,“total.sulfur.dioxide”,“density”,“pH”,“sulphates”,“alcohol”,“quality”). Quality can be considered as ordered factor variable with 7 distinct levels, which represent the quality of the wine on a scale of 1-10.In the actual data set wine quality is between 3-9, 3 being the worst and 9 being the best. If we see the distribution of quality column, we see that 2198 wines have quality 6 and 1457 wines are of quality 5. Very few(9) wines are highest quality(9) and very few(20) wines are at lowest quality(3)
other observations; 1. We observed that residual.sugar follows a bimodal distribution. 2. Many columns contain outliers, which needs to be considered before performing any modeling. 3.Since this is a tidy data set, we did not see any inconsistencies in the data such as missing values etc. 4. most of the variables follows a normal distribution 5. Variable X is used to number the observations, so it can be excluded from any analysis. 6.There are 19 records with citric.acid as 0, which imply that citric acid was not used in these wine.Given the description of citric.acid variable, It will be interesting to how it impact the quality perceived by the consumer. 7. Given the values of residual sugars, there is only one wine which can be considered sweet 8. IQR for volatile acidity, which can cause an unpleasant taste, is 0.11 and median is .26. It will be interesting to see if there is any perceived impact of high volatile acidity on the quality. There are 156 wines with volatile.acidity > 0.5 , which is compratively higher than rest of the wines. ##What is/are the main feature(s) of interest in your dataset? I am trying to see the impact of various attributes on the quality. How each attribute impacts the quality of the wine? Through descriptive analysis and initial intuition, i would like to explore the impact of alcohol , volatile.acidity , residual.sugars,citric.acid on the wine quality even though i have not yet analyzed the co-relation of these variables with quality.
At this point , it is very hard to judge what other parameters would be useful in our investigation but we will unfold all the relevant parameters as we progress
I created a new varaible bounds_SO2 using formula bounds_SO2= total.sulfur.dioxide-free.sulfur.dioxide as described in the data dictionary.
Other than this variable, i transformed the quality variable in to factor variable so that it is to investigate.
When i first plotted histograms for the variables, i immediately noticed the presence of outliers, which was evident from the summary as well. So i created histograms by excluding top 1% of the data.
Most of the features followed a normal distribution except residual.sugar variable.Another transformation was related to residual.sugar feature, i had to transform the x axis using log10 to identify that this variable follows a bimodal distribution. I also used geom_freqpoly() to better reveal the shap of the distribution.
Additionally, i changed the breaks on y axis while plotting quality histogram in order to see the exact boundaries of the bars.
Next, lets try to draw correlation matrix to see the corelation among different variables
From the corelation matrix, we see that alcohol has compartively better and positive correlation with wine quality. Rest other attributes either does not have a strong correlation or are negatively co-related with wine quality. For example, Density is negatively co-related with wine quality.
Lets explore this understanding further by drawing plot for Alchohal and quality.
# I have subset the dataset using subset and quantile function to remove top 1% of records
ggplot(aes(y=alcohol,x=quality),
data=subset(white_wine,white_wine$alcohol<
quantile(white_wine$alcohol,0.99)))+
geom_boxplot()+
xlab("quality")+
ggtitle("Quality by (%) alcohol level")
Lets draw line plot so that we can see the trend more clearly, we will also draw geom_point() in the same graph to see the distribution of cluster
# i have used stat=summary and median funciton on y axis and have subset the dataset using subset and quantile function to remove top 1% of records
ggplot(aes(y=alcohol,x=as.numeric(quality)) ,
data=subset(white_wine,white_wine$alcohol<
quantile(white_wine$alcohol,0.99)))+
geom_point(alpha=1/4,position="jitter")+
geom_line(stat="summary",fun.y=median)+
scale_x_discrete(breaks=seq(1,9,1))+
xlab("Quality")+
ggtitle("Plot for median Alcohol(%) against quality")
From the graph, we observe that even if alcohal level increases, wine quality also increases.
Lets see the correlation of density and quality
#I have subset the dataset using subset and quantile function to remove top 1% of records
ggplot(aes(y=density,x=quality),
data=subset(white_wine,white_wine$density<
quantile(white_wine$density,0.99)))+
geom_boxplot()+
xlab("Quality")+
ggtitle("Quality by density")
Lets draw line plot so that we can see the trend more clearly, we will also draw geom_point() in the same graph to see the distribution of cluster
# i have used stat=summary and median funciton on y axis and have subset the dataset using subset and quantile function to remove top 1% of records
ggplot(aes(y=density,x=as.numeric(quality)),
data=subset(white_wine,white_wine$density<
quantile(white_wine$density,0.99)))+
geom_point(alpha=1/4,position="jitter")+
geom_line(stat="summary",fun.y=median)+
scale_x_discrete(breaks=seq(1,9,1))+
xlab("Quality")+
ggtitle("Quality by Density with summary stats=median")
i want to see the corelation of each variable with quality , so lets draw each variable with quality using stat=summary and median function
# i have used stat=summary and median funciton on y axis and have subset the dataset using subset and quantile function to remove top 1% of records. Also i have used scale_x_discrete function to mark breaks on the x axis
s1 <- ggplot(aes(y=fixed.acidity,x=as.numeric(quality)) ,
data=subset(white_wine,white_wine$fixed.acidity<
quantile(white_wine$fixed.acidity,0.99)))+
geom_line(stat="summary",fun.y=median)+
scale_x_discrete(breaks=seq(1,9,1))+
xlab("Quality")+
ggtitle("Quality by fixed.acidity")
s2 <- ggplot(aes(y=volatile.acidity,x=as.numeric(quality)),
data=subset(white_wine,white_wine$volatile.acidity<
quantile(white_wine$volatile.acidity,0.99)))+
geom_line(stat="summary",fun.y=median)+
scale_x_discrete(breaks=seq(1,9,1))+
xlab("Quality")+
ggtitle("Quality by Volatile.acidity")
s3 <- ggplot(aes(y=citric.acid,x=as.numeric(quality)),
data=subset(white_wine,white_wine$citric.acid<
quantile(white_wine$citric.acid,0.99)))+
geom_line(stat="summary",fun.y=median)+
scale_x_discrete(breaks=seq(1,9,1))+
xlab("Quality")+
ggtitle("Quality by Citric.acid")
s4 <- ggplot(aes(y=chlorides,x=as.numeric(quality)),
data=subset(white_wine,white_wine$chlorides<
quantile(white_wine$chlorides,0.99)))+
geom_line(stat="summary",fun.y=median)+
scale_x_discrete(breaks=seq(1,9,1))+
xlab("Quality")+
ggtitle("Quality by chlorides")
s5 <- ggplot(aes(y=total.sulfur.dioxide,x=as.numeric(quality)),
data=subset(white_wine,white_wine$total.sulfur.dioxide<
quantile(white_wine$total.sulfur.dioxide,0.99)))+
geom_line(stat="summary",fun.y=median)+
scale_x_discrete(breaks=seq(1,9,1))+
xlab("Quality")+
ggtitle("Quality by total.sulfur.dioxide")
s6 <- ggplot(aes(y=density,x=as.numeric(quality)),
data=subset(white_wine,white_wine$density<
quantile(white_wine$density,0.99)))+
geom_line(stat="summary",fun.y=median)+
scale_x_discrete(breaks=seq(1,9,1))+
xlab("Quality")+
ggtitle("Quality by density")
s7 <- ggplot(aes(y=pH,x=as.numeric(quality)),
data=subset(white_wine,white_wine$pH<
quantile(white_wine$pH,0.99)))+
geom_line(stat="summary",fun.y=median)+
scale_x_discrete(breaks=seq(1,9,1))+
xlab("Quality")+
ggtitle("Quality by pH")
s8 <- ggplot(aes(y=sulphates,x=as.numeric(quality)),
data=subset(white_wine,white_wine$sulphates<
quantile(white_wine$sulphates,0.99)))+
geom_line(stat="summary",fun.y=median)+
scale_x_discrete(breaks=seq(1,9,1))+
xlab("Quality")+
ggtitle("Quality by sulphates")
s9 <- ggplot(aes(y=alcohol,x=as.numeric(quality)) ,
data=subset(white_wine,white_wine$alcohol<
quantile(white_wine$alcohol,0.99)))+
geom_line(stat="summary",fun.y=median)+
scale_x_discrete(breaks=seq(1,9,1))+
xlab("Quality")+
ggtitle("Quality by (%) alcohol")
s10 <- ggplot(aes(y=bound_SO2,x=as.numeric(quality)),
data=subset(white_wine,white_wine$bound_SO2<
quantile(white_wine$bound_SO2,0.99)))+
geom_line(stat="summary",fun.y=median)+
scale_x_discrete(breaks=seq(1,9,1))+
xlab("Quality")+
ggtitle("Quality by bound_SO2")
#lets arrange all the plots on the same grid.
grid.arrange(s1,s2,s3,s4,s5,s6,s7,s8,s9,s10)
##
## Pearson's product-moment correlation
##
## data: as.numeric(white_wine$quality) and white_wine$chlorides
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2365501 -0.1830039
## sample estimates:
## cor
## -0.2099344
We notice that chloride seems to have negative coorelation with quality of wine. This is also evident from the co-realtion coefficient between these two variables,which is -.21.
It will also be a good idea to see if other variables are corelated.
#lets draw the corelation matrix for all the variables except quality.
cor(white_wine[,!names(white_wine) %in% c("quality")])
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.25581431 0.002857966
## fixed.acidity -0.255814305 1.00000000 -0.022697290
## volatile.acidity 0.002857966 -0.02269729 1.000000000
## citric.acid -0.149899918 0.28918070 -0.149471811
## residual.sugar 0.006623775 0.08902070 0.064286060
## chlorides -0.045645192 0.02308564 0.070511571
## free.sulfur.dioxide -0.011928911 -0.04939586 -0.097011939
## total.sulfur.dioxide -0.161979037 0.09106976 0.089260504
## density -0.185976097 0.26533101 0.027113845
## pH -0.115774132 -0.42585829 -0.031915368
## sulphates 0.009807759 -0.01714299 -0.035728147
## alcohol 0.213656245 -0.12088112 0.067717943
## bound_SO2 -0.192413361 0.13566071 0.156769227
## citric.acid residual.sugar chlorides
## X -0.14989992 0.006623775 -0.04564519
## fixed.acidity 0.28918070 0.089020701 0.02308564
## volatile.acidity -0.14947181 0.064286060 0.07051157
## citric.acid 1.00000000 0.094211624 0.11436445
## residual.sugar 0.09421162 1.000000000 0.08868454
## chlorides 0.11436445 0.088684536 1.00000000
## free.sulfur.dioxide 0.09407722 0.299098354 0.10139235
## total.sulfur.dioxide 0.12113080 0.401439311 0.19891030
## density 0.14950257 0.838966455 0.25721132
## pH -0.16374821 -0.194133454 -0.09043946
## sulphates 0.06233094 -0.026664366 0.01676288
## alcohol -0.07572873 -0.450631222 -0.36018871
## bound_SO2 0.10217934 0.344844495 0.19379550
## free.sulfur.dioxide total.sulfur.dioxide density
## X -0.0119289106 -0.161979037 -0.18597610
## fixed.acidity -0.0493958591 0.091069756 0.26533101
## volatile.acidity -0.0970119393 0.089260504 0.02711385
## citric.acid 0.0940772210 0.121130798 0.14950257
## residual.sugar 0.2990983537 0.401439311 0.83896645
## chlorides 0.1013923521 0.198910300 0.25721132
## free.sulfur.dioxide 1.0000000000 0.615500965 0.29421041
## total.sulfur.dioxide 0.6155009650 1.000000000 0.52988132
## density 0.2942104109 0.529881324 1.00000000
## pH -0.0006177961 0.002320972 -0.09359149
## sulphates 0.0592172458 0.134562367 0.07449315
## alcohol -0.2501039415 -0.448892102 -0.78013762
## bound_SO2 0.2635372837 0.922482350 0.50444690
## pH sulphates alcohol bound_SO2
## X -0.1157741316 0.009807759 0.21365624 -0.192413361
## fixed.acidity -0.4258582910 -0.017142985 -0.12088112 0.135660713
## volatile.acidity -0.0319153683 -0.035728147 0.06771794 0.156769227
## citric.acid -0.1637482114 0.062330940 -0.07572873 0.102179337
## residual.sugar -0.1941334540 -0.026664366 -0.45063122 0.344844495
## chlorides -0.0904394560 0.016762884 -0.36018871 0.193795498
## free.sulfur.dioxide -0.0006177961 0.059217246 -0.25010394 0.263537284
## total.sulfur.dioxide 0.0023209718 0.134562367 -0.44889210 0.922482350
## density -0.0935914935 0.074493149 -0.78013762 0.504446902
## pH 1.0000000000 0.155951497 0.12143210 0.003143387
## sulphates 0.1559514973 1.000000000 -0.01743277 0.135693943
## alcohol 0.1214320987 -0.017432772 1.00000000 -0.426923036
## bound_SO2 0.0031433874 0.135693943 -0.42692304 1.000000000
From the corelation matrix and corelation matrix, we observe there is strong positive corelation between density and residual sugar, which makes sense as sugar content increase the density of a liquid. Also, there is strong negative corelation between density and alcohol content and ther is positive corelation between density and total.sulfur.dioxide. lets draw these 3 plots
# i have used stat=summary and median funciton on y axis and have subset the dataset using subset and quantile function to remove top 1% of records. Also i have used geom_smooth to draw the smooth line for the distribution.
bivariate_1<- ggplot(aes(y=alcohol,x=density),
data=subset(white_wine,white_wine$density<
quantile(white_wine$density,0.99)))+
geom_line(colour="orange",stat="summary",fun.y=median)+
geom_smooth()+
ggtitle("Quality by Density(median)")
bivariate_2 <- ggplot(aes(y=residual.sugar,x=density) ,
data=subset(white_wine,white_wine$density<
quantile(white_wine$density,0.99)))+
geom_line(colour="orange",stat="summary",fun.y=median)+
geom_smooth()+
ggtitle("Residual.sugar by Density(median)")
bivariate_3 <- ggplot(aes(y=total.sulfur.dioxide,x=density),
data=subset(white_wine,white_wine$density<
quantile(white_wine$density,0.99)))+
geom_line(colour="orange",stat="summary",fun.y=median)+
geom_smooth()+
ggtitle("total.sulfur.dioxide by Density(median)")
#lets draw all the plots on the single grid
grid.arrange(bivariate_1,bivariate_2,bivariate_3)
It is also important to note that , residual sugar and alcohal are negatively correlated. Lets draw the plot to see the relationship
# i have used stat=summary and median funciton on y axis and have subset the dataset using subset and quantile function to remove top 1% of records. Also i have used geom_smooth to draw the smooth line for the distribution.
ggplot(aes(y=residual.sugar,x=alcohol),
data=subset(white_wine,white_wine$alcohol<
quantile(white_wine$alcohol,0.99)))+
geom_point(alpha=1/4,position="jitter")+
geom_line(colour="orange",stat="summary",fun.y=median)+
geom_smooth()+
ggtitle("(%)Alcohol by residual.sugar (summary statistic=median)")
Let’s also see the relationship between total.sulfur.dioxide and alcohol
# i have used stat=summary and median funciton on y axis and have subset the dataset using subset and quantile function to remove top 1% of records. Also i have used geom_smooth to draw the smooth line for the distribution.
ggplot(aes(y=total.sulfur.dioxide,x=alcohol) ,
data=subset(white_wine,white_wine$alcohol<
quantile(white_wine$alcohol,0.99)))+
geom_line(colour="orange",stat="summary",fun.y=median)+
geom_point(alpha=1/4,position="jitter")+
geom_smooth()+
ggtitle("(%)alcohol by total.sulfur.dioxide(summary statistic=median)")
Another , important negative co-relation is between ph value and fixed acidtity.
# i have used stat=summary and median funciton on y axis and have subset the dataset using subset and quantile function to remove top 1% of records. Also i have used geom_smooth to draw the smooth line for the distribution.
ggplot(aes(y=fixed.acidity,x=pH) ,
data=subset(white_wine,white_wine$alcohol<
quantile(white_wine$alcohol,0.99)))+
geom_line(colour="orange",stat="summary",fun.y=median)+
geom_point(alpha=1/4,position="jitter")+
geom_smooth()+
ggtitle("pH by fixed.acidity(summary statistic=median)")
I started my anlaysis to find some features , which have good co-relation with quality of the wine. I found that alcohol has good positive co-relation with the quality of wine. Other than alcohol, chlorides seems to have negative corelation with quality of wine. Rest of the features are not very strongly correlated with quality of wine.
Apart from corelation with quality, i also analyzed the corelation between alcohol, density , and residual.sugar variable. I found that: 1.There is strong positive corelation between density and residual sugar. 2.There is strong negative corelation between density and aloohol content. 3.There is positive corelation between density and total.sulfur.dioxide. 3.Residual.sugar and alcohol are negatively corelated.
Additionally, i observed the correlation between pH value and Fixed acidity.
I observed some interesting corelation of density with alcohal,residual.sugar and total.sulfar.dioxide. As stated in above section: 1.There is strong positive corelation between density and residual sugar. 2.There is strong negative corelation between density and aloohol content. 3.There is positive corelation between density and total.sulfur.dioxide. 4.Residual.sugar and alcohol are negatively corelated. 5. I also noticed the spurious co-relation between bound_SO2 and total.sulphur.dioxide variable because bound_SO2 is derived from total.sulphur.dioxise variable.
Strongest co-relation exist between residual.sugar and density, which is .838 and it makes sense as sugar content have large impact on the density of any liquid.
Since we dont have any categorical variables in this data set, we would create some categorical variable and will do the mutlivariate analysis. I will create categorical variables from based on the quantile values of the variables and would label the interval as low,med,high and extreme respectively.
1.choloride - > chloride_cat 2.volatile.acidity -> volatile_acidity_cat 3.density -> density_cat 4.residual.sugar ->residual.sugar_cat
#cut the variable by the quantile range and assign the lables low,med,high and extreme for each range
white_wine$chloride_cat <- cut(white_wine$chlorides,
breaks=c(quantile(white_wine$chlorides)),
labels=c("Low","Med","High","Extreme"),
include.lowest=TRUE)
#cut the variable by the quantile range and assign the lables low,med,high and extreme for each range
white_wine$volatile_acidity_cat <- cut(white_wine$volatile.acidity,
breaks=c(quantile(white_wine$volatile.acidity)),
labels=c("Low","Med","High","Extreme"),
include.lowest=TRUE)
#cut the variable by the quantile range and assign the lables low,med,high and extreme for each range
white_wine$density_cat <- cut(white_wine$density,
breaks=c(quantile(white_wine$density)),
labels=c("Low","Med","High","Extreme"),
include.lowest=TRUE)
#cut the variable by the quantile range and assign the lables low,med,high and extreme for each range
white_wine$residual.sugar_cat <- cut(white_wine$residual.sugar,
breaks=c(quantile(white_wine$residual.sugar)),
labels=c("Low","Med","High","Extreme"),
include.lowest=TRUE)
Now , we have our 4 new categorical variable lets do some multivariate analysis.
#I have used alpha level and jitter in this plot to make it more readable as there was lot of overplotting in original plot
ggplot(aes(x=as.numeric(quality),y=alcohol,color=density_cat),data=white_wine)+
geom_point(alpha=1/3,position="jitter",size=3)+
scale_x_discrete(breaks=seq(1,9,1))+xlab("Quality")+
scale_color_discrete(name="Density")+
ggtitle("Quality by (%)Alcohol and density")
We see that for Low density alcohol level is high and quality increases with the alcohol level.
lets draw a line plot to see it more clearly.
ggplot(aes(x=as.numeric(quality),y=alcohol,color=density_cat) ,data=white_wine)+
geom_line(stat="summary",fun.y=median)+
geom_point(alpha=1/4,position="jitter")+
scale_x_discrete(breaks=seq(1,9,1))+
scale_color_discrete(name="Density")+
xlab("Quality")+
ggtitle("Quality by (%)Alcohol(summary median) and density")
Above plot lines makes sense as alcohal and density have negative corelation.
Now , lets include volatile.acidity instead of density.
#I have used alpha level and jitter in this plot to make it more readable as there was lot of overplotting in original plot
ggplot(aes(x=as.numeric(quality),y=alcohol,color=volatile_acidity_cat),
data=white_wine)+geom_point(alpha=1/3,position="jitter",size=3)+
scale_x_discrete(breaks=seq(1,9,1))+
scale_color_discrete(name="volatile.acidity")+
xlab("Quality")+
ggtitle("Quality by (%)Alcohol and volatile acidity")
To see the relationship better lets draw the line plot for each category.
ggplot(aes(x=as.numeric(quality),y=alcohol,color=volatile_acidity_cat) ,
data=white_wine)+geom_line(stat="summary",fun.y=median)+
scale_x_discrete(breaks=seq(1,9,1))+
scale_color_discrete(name="volatile.acidity")+
xlab("Quality")+
ggtitle("Quality by (%)Alcohol(summary median) and volatiel acidity")
We see the relationship between alcohol , volatile.acidity and quality. As volatile.acidity increases the quality and alcohol level increases.
Now , lets include chlorides instead of volatile.acidity.
#I have used alpha level and jitter in this plot to make it more readable as there was lot of overplotting in original plot
ggplot(aes(x=as.numeric(quality),y=alcohol,color=chloride_cat) ,
data=white_wine)+
geom_point(alpha=1/3,position="jitter",size=3)+
scale_x_discrete(breaks=seq(1,9,1))+
scale_color_discrete(name="chlorides")+
xlab("Quality")+
ggtitle("Quality by (%)Alcohol and chlorides")
To see the realtionship better lets draw the line plot for each category.
ggplot(aes(x=as.numeric(quality),y=alcohol,color=chloride_cat) ,
data=white_wine)+
geom_line(stat="summary",fun.y=median)+
geom_point(alpha=1/4,position="jitter")+
scale_x_discrete(breaks=seq(1,9,1))+
scale_color_discrete(name="chlorides")+
xlab("Quality")+
ggtitle("Quality by (%)Alcohol(summary median) and chlorides")
Now , lets include residual.sugar
#I have used alpha level and jitter in this plot to make it more readable as there was lot of overplotting in original plot
ggplot(aes(x=as.numeric(quality),y=alcohol,color=residual.sugar_cat) ,
data=white_wine)+
geom_point(alpha=1/3,position="jitter",size=3)+
scale_x_discrete(breaks=seq(1,9,1))+
scale_color_discrete(name="residual.sugar")+
xlab("Quality")+
ggtitle("Quality by (%)Alcohol and Residual sugar")
#I have used stat=summary and median function on the y axis
ggplot(aes(x=as.numeric(quality),y=alcohol,color=residual.sugar_cat) ,
data=white_wine)+
geom_point(alpha=1/4,position="jitter")+
geom_line(stat="summary",fun.y=median)+
scale_x_discrete(breaks=seq(1,9,1))+
scale_color_discrete(name="residual.sugar")+
xlab("Quality")+
ggtitle("Quality by (%)Alcohol(summary median) and residual.sugar")
We see that as exterme levels of residual.sugar, alcohol is less , which makes sense one add the sweetness and other adds the little bitterness in the flavor.
Now, lets include our fourth variable in the plot by faceting.
#Here we will facet the plot by chloride_cat
ggplot(aes(x=as.numeric(quality),y=alcohol,color=density_cat) ,
data=white_wine)+
geom_line(stat="summary",fun.y=median)+
geom_point(alpha=1/4,position="jitter")+
scale_x_discrete(breaks=seq(1,9,1))+
facet_wrap(~chloride_cat)+
scale_color_discrete(name="Density")+
xlab("Quality")+
ggtitle("Quality by (%)Alcohol(summary stat=median),density and chlorides")
Chlorides does not seems have lot of impact on the quality.
Also, as we observed during our bivariate analysis that there exists a correlation among residual.sugard,density and alcohal so lets do a multivariate analysis
#I have used alpha level and jitter in this plot to make it more readable as there was lot of overplotting in original plot
ggplot(aes(x=density,y=alcohol,color=residual.sugar_cat) ,
data=white_wine)+
geom_line(stat="summary",fun.y=median)+
scale_color_discrete(name="Residual.sugar")+
xlab("Quality")+
ggtitle("Density by (%)Alcohol(summary median) and residual.sugar")
ggplot(aes(x=density,y=alcohol,color=residual.sugar_cat) ,
data=white_wine)+
geom_point(alpha=1/4,position="jitter")+
geom_smooth()+
scale_color_discrete(name="Residual.sugar")+
xlab("Quality")+
ggtitle("Density by (%)Alcohol and Residual.sugar")
We see the relationship more clearly from our above graph.We observe here that there is presence of an outlier , as we had seen initially during summary statistics.Lets remove the outlier by subsetting our data
ggplot(aes(x=density,y=alcohol,color=residual.sugar_cat) ,
data=subset(white_wine,white_wine$residual.sugar<
quantile(white_wine$residual.sugar,0.99)))+
geom_point(alpha=1/10,position="jitter")+
geom_smooth()+
scale_color_discrete(name="Residual.sugar")+
xlab("Density")+
ggtitle("Density by (%)Alcohol and Residual.sugar")
This graph helps us in seeing the negative co-relation more clearly We will create our model by using density,chlorides,volatile.acidity,residual.suagar and alcohal features.
# draw model foe input variable alcohol and target variable quality
m1 <- lm(as.numeric(quality)~alcohol,
data=subset(white_wine,white_wine$alcohol<
quantile(white_wine$alcohol,0.99)))
# udpate the model by adding density feature
m2 <- update(m1, ~ . + density)
# udpate the model by adding chlorides feature
m3 <- update(m2, ~. + chlorides)
# udpate the model by adding volatile.acidity feature
m4 <- update(m3, ~. + volatile.acidity)
# udpate the model by adding residual.sugar feature
m5 <- update(m4, ~. + log10(residual.sugar))
# draw the table for our models for compartive analysis
mtable(m1, m2, m3, m4, m5)
##
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = subset(white_wine,
## white_wine$alcohol < quantile(white_wine$alcohol, 0.99)))
## m2: lm(formula = as.numeric(quality) ~ alcohol + density, data = subset(white_wine,
## white_wine$alcohol < quantile(white_wine$alcohol, 0.99)))
## m3: lm(formula = as.numeric(quality) ~ alcohol + density + chlorides,
## data = subset(white_wine, white_wine$alcohol < quantile(white_wine$alcohol,
## 0.99)))
## m4: lm(formula = as.numeric(quality) ~ alcohol + density + chlorides +
## volatile.acidity, data = subset(white_wine, white_wine$alcohol <
## quantile(white_wine$alcohol, 0.99)))
## m5: lm(formula = as.numeric(quality) ~ alcohol + density + chlorides +
## volatile.acidity + log10(residual.sugar), data = subset(white_wine,
## white_wine$alcohol < quantile(white_wine$alcohol, 0.99)))
##
## =============================================================================
## m1 m2 m3 m4 m5
## -----------------------------------------------------------------------------
## (Intercept) 0.547*** -25.686*** -24.265*** -37.761*** 37.554***
## (0.102) (6.196) (6.194) (6.036) (9.400)
## alcohol 0.317*** 0.367*** 0.349*** 0.388*** 0.312***
## (0.010) (0.015) (0.016) (0.015) (0.017)
## density 25.864*** 24.731*** 38.424*** -36.868***
## (6.108) (6.103) (5.950) (9.345)
## chlorides -2.358*** -1.304* -0.837
## (0.561) (0.546) (0.542)
## volatile.acidity -2.059*** -2.125***
## (0.113) (0.112)
## log10(residual.sugar) 0.495***
## (0.048)
## -----------------------------------------------------------------------------
## R-squared 0.182 0.185 0.188 0.241 0.257
## adj. R-squared 0.182 0.185 0.188 0.240 0.257
## sigma 0.797 0.796 0.794 0.768 0.760
## F 1078.718 550.215 373.959 383.403 335.009
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5766.395 -5757.439 -5748.620 -5586.807 -5533.531
## Deviance 3073.150 3061.792 3050.648 2853.217 2791.052
## AIC 11538.790 11522.879 11507.241 11185.613 11081.061
## BIC 11558.242 11548.815 11539.661 11224.518 11126.450
## N 4837 4837 4837 4837 4837
## =============================================================================
We see that R squred is ~.27 , which means this model does not provide very strong corelation matrix and only 27% of change in quality is explained by these features.
we also presumed that citric.acid , total sulfur dioxide may have some relation with quality.Even though our anlaysis has shown that there is no or very small relation among these. lets build a model to anlayze the correlation.
# draw model foe input variable alcohol and target variable quality
m1_1 <- lm(as.numeric(quality)~alcohol,
data=subset(white_wine,white_wine$alcohol<
quantile(white_wine$alcohol,0.99)))
#update the model by adding citric.acid feature
m1_2 <- update(m1_1, ~ . + citric.acid,
data=subset(white_wine,white_wine$citric.acid<
quantile(white_wine$citric.acid,0.99)))
#update the model by adding bound_SO2 feature
m1_3 <- update(m1_2, ~. + bound_SO2,
data=subset(white_wine,white_wine$bound_SO2<
quantile(white_wine$bound_SO2,0.99)))
# get the summary for final model
summary(m1_3)
##
## Call:
## lm(formula = as.numeric(quality) ~ alcohol + citric.acid + bound_SO2,
## data = subset(white_wine, white_wine$bound_SO2 < quantile(white_wine$bound_SO2,
## 0.99)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6024 -0.5205 -0.0155 0.4904 3.1249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6798325 0.1350212 5.035 4.95e-07 ***
## alcohol 0.3053997 0.0103170 29.602 < 2e-16 ***
## citric.acid 0.2146354 0.0956810 2.243 0.0249 *
## bound_SO2 -0.0008066 0.0003847 -2.097 0.0361 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7968 on 4844 degrees of freedom
## Multiple R-squared: 0.1908, Adjusted R-squared: 0.1903
## F-statistic: 380.6 on 3 and 4844 DF, p-value: < 2.2e-16
We can see that our previous model was better than this model.
I observed that there variables density, chlorides , volatile.acidity , residual.sugar and alcohal are at least somewhat corelated with quality. I also observed the correlation between alcohol , density and residual.sugar.
In this section , i was able to find out some new relationships among quality, total sulfur dioxide and pH value.
i observed the relationship between alcohol , density and residual.sugar. Also i analyzed the relationship between quality, alcohol,citric.acid and bound_so2 and observed that they are very weekly corelated . This is different than what i thoughy initially after reading the description of these variables.
I created a linear model with my input features ensity,chlorides,volatile.acidity,residual.suagar and alcohol. We see that R squared is ~.27 , which means this model does not provide very strong corelation matrix and only 27% of change in quality is explained by these features. The relationship among these variable does not seem to be normal so linear model does not explain the relationship very well.
ggplot(aes(y=alcohol,x=density),
data=subset(white_wine,white_wine$density<
quantile(white_wine$density,0.99)))+
geom_line(colour="orange",stat="summary",fun.y=median)+
geom_smooth()+
ylab("% Alcohol")+
xlab("Density(g / dm^3)")+
ggtitle("(%)Alcohol by Density(summary median and 99 quantile) ")
From the graph it is clear that alcohol and density and negativly correlated. As the alcohol level increases, density tends to decrease. As we had seen earlier, the negative corelation between these two variable is also explained by the cor coefficient =-0.78. This plot helped me find out the relationship between alcohol and density. After this analysis, i was able to identify the relationship between alcohol , density and residual.sugar. Also, this helped in gaining a better understanding about my final features of the model , i was going to create.
ggplot(aes(y=alcohol,x=as.numeric(quality)) ,
data=subset(white_wine,white_wine$alcohol<
quantile(white_wine$alcohol,0.99)))+
geom_line(stat="summary",fun.y=median)+
geom_point(alpha=1/5 , position="jitter")+
scale_x_discrete(breaks=seq(1,9,1))+
xlab("Quality")+
ylab("% Alcohol")+
ggtitle("Quality by (%)Alcohol(summary median and 99 quantile) ")
This plot explains how quality is highly corelated with the alcohol. Wines with high levels of alcohols are considered better quality , as per the dataset. This relationship is also evident from the cor coefficient between these two variable , which is 0.436
This plot helped me finding one of my most important finding between alcohol and quality.
ggplot(aes(x=as.numeric(quality),y=alcohol,color=density_cat) ,
data=subset(white_wine,white_wine$alcohol<
quantile(white_wine$alcohol,0.99)))+
geom_point(alpha=1/5,position="jitter")+
geom_line(stat="summary",fun.y=median)+
scale_x_discrete(breaks=seq(1,9,1))+
xlab("Quality")+
ylab("% Alcohol")+
scale_color_discrete(name="Density")+
ggtitle("Quality by % Alcohol(summary median) and density")
We see that quality is positively correlated with alcohol. As the alcohol content increases. quality seems to improve. Also , we see the impact of density on quality along with alcohol.Density has cor coefficient =-0.30 with quality . Also, density is negative correlated with alchol with cor coefficient being -0.78.From the relationship we see that quality seems to improve for low density and high % alcohol by volume even though not highly correlated. Through the plots i was able to find out the relationship between density,alcohol and quality.These plots helped in identifying the my features for the model.
This project helped in gaining insights to various ggplot techniques. By choosing a data set, which was completely unknown to me , i was able to build a good understanding about the variables and the relationships among the variables. During my analysis, i learned that how we can derive useful insights even if data is not presented with categorical variables. It was very helpful to understand techniques and methods to build the final model. I understand that the final model does require changes as it is not a strong model but this entire exercise laid foundation for me to work on any data set in the future. During my analysis, i came across various challenges, i kind of felt lost at a point, when i sat down with pen and paper to see what i am trying to achieve and how i can explore various relationships. My Major hurdle was breaking the variables in to categories; at first, i was wondering how to do multivariate analysis but after revising the course notes, i got hint of cutting the variable and doing the analysis. At the end, i am very satisfied that this data set gave me so many ways to enhance my understanding and skills. To devise a strong model , i need to learn more about the various variables in detail and new modeling techniques to find out what all different models can be applied under different situations. I have understood that domain knowledge is very helpful in deriving insights from any data set.